Data Analysis
Why?
Data analysis is at the core of data science, enabling professionals to extract insights, identify patterns, and make evidence-based decisions. This course equips you with practical tools and techniques to analyze structured and unstructured data from various sources, preparing them for real-world applications such as predictive modeling, trend analysis, and text mining. Similarity will pop up in many algorithms & fields (for example: clustering, recommendation systems, KNN, RAG...)
What?
This course covers foundational and advanced topics in data analysis, from data collection and preprocessing to the application of analytical models. You will learn about web data extraction, file format handling, time series, text mining techniques, and similarity measures.
Curriculum:
Introduction to Data Analysis
Overview of the data analysis process, key concepts in data science, types of data, data cleaning basics, and exploratory data analysis (EDA) techniques using visual and numerical summaries.
Data Scraping and Web Data Collection
Techniques for extracting data from websites and online platforms, understanding APIs, using tools like BeautifulSoup or Scrapy, and addressing ethical considerations in web scraping.
XML and JSON Data Formats
Understanding the structure of XML and JSON files, parsing techniques in Python, transforming nested data, and integrating data from external web APIs and files.
Time Series Analysis
Analyzing temporal data, identifying trends and seasonality, time series decomposition, forecasting methods like ARIMA, and applications in business and finance.
Sentiment Analysis
Using natural language processing to classify text sentiment, understanding lexicon-based vs. machine learning approaches, and applying sentiment models to product reviews, tweets, or news articles.
Text Analysis
Techniques for preprocessing textual data (tokenization, stemming, stopword removal), feature extraction methods like TF-IDF, and identifying patterns or topics within large text corpora.
Regression Models
Review of linear regression, introduction to multiple regression, model evaluation techniques, assumptions checking, and using regression in predictive analytics.
Dynamic Programming and Edit Distance
Understanding dynamic programming as an algorithmic strategy, computing edit distance between strings, applications in spell checking, plagiarism detection, and bioinformatics.
Similarity
Measuring similarity between text documents or data vectors using cosine similarity, Jaccard index, Euclidean distance, and understanding their roles in clustering and recommendation systems.
Vector Space Model
Representing documents in vector space, applying TF-IDF weighting, calculating similarity, and using the model in information retrieval and ranking tasks.
Notes
Don’t worry about knowing everything at once—focus on understanding the logic behind each method and how it’s used in real-world problems. Small projects or side exercises can really help solidify the concepts.